\(\newcommand{\expec}{\mathbb{E}}\) \(\newcommand{\prob}{\mathbb{P}}\) \(\newcommand{\indic}{\mathbb{1}}\) \(\DeclareMathOperator*{\argmax}{arg\,max}\)

In this document, we will deal with the MALA and variants algorithms such as: MALA, Nesterov, Adaptive MALA, Anisotropic MALA, Non reversible MALA. The second part will deal with an improvment of the diffusion taking into considerantion non smoothness of the potential we are sampling.

The Model

We study a classical missing data problem where:

Our objective is to create samples from the posterior distribiution \(P_{\psi_i|Y_i,\theta}\) for all individuals i and at a fixed model parameter \(\theta\).

In this document, we will consider N i.i.d. observations \(y=(y_i, 1\leqslant i \leqslant N)\), unobserved individuals parameters \(\psi=(\psi_i, 1\leqslant i \leqslant N)\) and a vector of parameters \(\theta\). The goal is to find the parameter \(\theta\) that maximizes the likelihood \(p(y;\theta)\):

\[p(y;\theta) = \int_{}^{} p(y,\psi;\theta) \, \mathrm{d}\psi\]

We are going to restrict ourselves to models that belong to the exponential family. The complete data log likelihood can be expressed as:

\[\log p(y, \psi; \theta) = -\phi(\theta) + \langle S(y,\psi){,} \Phi(\theta) \rangle\]

With \(\langle \cdot{,} \cdot \rangle\) being the scalar product and \(\phi(\theta)\), \(\Phi(\theta)\) and \(S(y,\psi)\) are known functions.\ We consider a joint model for the observations \(y = (y_i, 1\leqslant i \leqslant N)\) and the individual parameters \(\psi = (\psi_i, 1\leqslant i \leqslant N)\): \[p(y, \psi; \theta) = p(y|\psi; \theta)p(\psi; \theta)\] Where \(p(y|\psi; \theta)\) is the conditional distribution of the observations of individual \(i\) and \(p(\psi; \theta)\) the distribution of the individual parameters. Also:

\[\begin{split} & y_i = f(t_i)+\sigma\epsilon_i \textrm{ with $\epsilon_i\sim \mathcal{N}(0,1)$}\\ &\psi_i = c_i*\psi_{pop}+\eta_i\textrm{ with $\eta_i\sim \mathcal{N}(0,\Omega)$} \end{split}\]

\end{equation} \(f\) is a non linear function solution of a PK-PD Ordinary Differential Equation.\ As a result we have the following hierarchical model:

\[\begin{split} & y_i|\psi_i \sim \mathcal{N}(f(t),\sigma)\\ & \psi_i \sim \mathcal{N}(\psi_{pop}, \Omega) \end{split}\]

MALA and variants

Mala

In this section, we will describe an MCMC method that consists in constructing a Markov chain based on the discretized Langevin diffusion. Samples are accepted and rejected thanks to a MH step. Let denote \(\pi\) the distibution we want to sample from. In our case it is the conditional density \(p(\psi|y;\theta)\). We consider the overdamped Langevin diffusion:

\[\dot{X} = \nabla \log \pi(X) + \sqrt{2}\dot{W}\]

This diffusion is driven by the time derivative of a Brownian motion W. It is proven that this diffusion admits as stationary distribution \(\pi\). We’ll use the Euler-Maruyama discretization of this diffusion such as:

\[X_{k+1} = X_k + \gamma\nabla\log\pi(X) + \sqrt{2\gamma}Z\]

\(\gamma > 0\) is the discretization step.

The algorithm consists then in drawing a sample candidate from the Langevin proposal (basically a gaussian, implied by the brownian motion, centered in \(X_k + \gamma\nabla\log\pi(X)\) with variance \(\sqrt{2\gamma}\)) and applying an MH step.

\[\eta_i^k = \eta_i^{k-1}+\gamma\nabla\log\pi(\eta_i^{k-1}+\sqrt{2\gamma} Z \textrm{ with }Z \sim \mathcal{N}(0,1)\]

And the acceptance ratio: \[ \begin{split} \alpha(\eta_i^{k},\eta_i^{k-1}) = \frac{q(\eta_i^{k-1}|\eta_i^{k})p_{\eta_i}(\eta_i^{k})p_{y_i|\eta_i}(\eta_i^{k})}{q(\eta_{k}|\eta_i^{k-1})p_{\eta_i}(\eta_i^{k-1})p_{y_i|\eta_i}(\eta_i^{k-1})} \end{split} \] With \(q(.)\) being the proposal distribution. Here again, in our implementation the adaptive stepsize \(\gamma\) updates consists in (cf Atachadé adaptive MALA):

\[\gamma = \gamma (1+\delta(\alpha(\eta_i^{k},\eta_i^{k-1})-\alpha^{*}))\]

The stationnary distribution of that Markov Chain is the conditional distribution we intended to sample from.

Theophylline (12 individuals 10 observations per individual)

Beforehand, the standard approach is to approximate the body as a simple compartment models. In this example we will focus on a one-compartment model for theophylline following oral dose D at time \(t=0\) leading to description of concentration \(y(t_i)\) at time \(t_i \geq 0\) (i varies from 1 to N and denote the individual of the population): \[ y_i = y(t_i) = f(\psi_i)+ \epsilon_i \] With :

\[f(\psi_i) = \frac{D(k_a)_i}{V_i((k_a)_i - (C_l)_i/V_i)}(e^{(-(k_a)_it_i)}-e^{(-\frac{(C_l)_i}{V_i}t_i)})\]

Where \((k_a)_i\) is the fractional rate of absorption for individual \(i\), \((C_l)_i\) is the clearance rate for individual \(i\) and \(V_i\) is the volume of distribution for individual \(i\) and D is the dose injected. In our notation, the complete model is \(p(y_i,\psi_i,\theta_0)\) where \(\psi_i = ((k_a)_i, (C_l)_i, V_i)\) is the vector of individual parameters where each component is composed of a fixed effect term and a random effect (a centered Gaussian with same variance \(\Omega\)) and \(\theta_0 = (\Omega_0, \Sigma_0)\) (with \(\epsilon_i \sim \mathcal{N}(0,\Sigma_0)\) and \((k_a)_i = (k_a)_{pop} + \eta_{(k_a)_i}\) and \(\eta_{(k_a)_i} \sim \mathcal{N}(0,\Omega_0)\)). Our goal is to simulate for instance from the posterior distribution \(P((k_a)_i|y_i,\theta_0)\). As we said above we can work on the distribution \(P(\eta_{(k_a)_i}|y_i,\theta_0)\) and equivalently for the others parameters.

Posterior samples for all individuals

Autocorrelations and MSJD

In terms of Autocorrelation and Mean Square jump distance we get

index=4
print(paste0("msjd rwm: ", mssd(post_rwm[[index]][,3])))
## [1] "msjd rwm: 0.00908121502634136"
print(paste0("msjd mala: ", mssd(post_mala[[index]][,3])))
## [1] "msjd mala: 0.00769192366098725"

Variance of the estimator

20 replicates of RWM and MALA during 1000 iterations Only the 400 last iterations are used to compare the variances of the estimator

Cow (560 individuals 5455 obs)

\[ y_i = y(t_i) = f(\psi_i)+ \epsilon_i \] With :

\[f(\psi_i) = a.(1-(b/1000).*exp(-k.*(x/100000)))\]

With \(\psi_i = (a_i,b_i,k_i)\)

Nesterov acceleration

Inspired by the Nesterov acceleration of the gradient descent we add a memory term to the Overdamped discretized langevin diffusion

\[X_{k+1} = X_k + \gamma\nabla\log\pi(X) +R(X_{k} - X_{k-1}) + \sqrt{2\gamma}Z\]

Where \(R>0\)

Theo case

Posterior samples for all individuals

Autocorrelations and MSJD

In terms of Autocorrelation and Mean Square jump distance we get

index=4
print(paste0("msjd rwm: ", mssd(post_rwm[[index]][,3])))
## [1] "msjd rwm: 0.0082680547664603"
print(paste0("msjd mala: ", mssd(post_mala[[index]][,3])))
## [1] "msjd mala: 0.00818262145854394"
print(paste0("msjd nesterov: ", mssd(post_nest[[index]][,3])))
## [1] "msjd nesterov: 0.00769192366098725"

Variance of the estimator

Anisotropic MALA

In this section we consider the regular langevin diffusion markov chain and set a dependence on the direction of the diffusion for the stepsize for the drift term. The stepsize of the brownian motion, in other words the covariance of the proposal is also dependent of the direction of the gradient.

\[X_{k+1} = X_k + \gamma_1\nabla\log\pi(X) + \sqrt{2\gamma_2}Z\]

Where: \[\gamma_1 = \frac{b}{max(b, |\nabla \log \pi|)}\] Where \(b>0\) is a threshold term and: \[\gamma_2 = \gamma_1^2 \nabla \log \pi^{t} \nabla \log \pi\]

Theo case

Posterior samples for all individuals

####Autocorrelations and MSJD In terms of Autocorrelation and Mean Square jump distance we get

Variance of the estimator

Non reversible MALA

Finally, another method is implemented in order to improve autocorrelation and variance properties. Following R. Poncet Generalized Mala and Neal paper “Non reversible chains are better” (https://arxiv.org/pdf/math/0407281.pdf)

Posterior samples for all individuals

####Autocorrelations and MSJD In terms of Autocorrelation and Mean Square jump distance we get

Variance of the estimator

Moreau Yosida MALA (MAMYULA)

In all of our cases the potential is composed of a smooth and a non smooth part.

Theo case

Posterior samples for all individuals

Autocorrelations and MSJD

In terms of Autocorrelation and Mean Square jump distance we get

SAEM with MALA and MAMYULA

Theo case

Variance of the estimator

Drawing

Cow case

Variance of the estimator

Drawing